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ABSTRACT 

Sparse matrix-vector multiplication (SpMV) is a fundamen¬ 
tal building block for numerous applications. In this paper, 
we propose CSR5 (Compressed Sparse Row 5), a new stor¬ 
age format, which offers high-throughput SpMV on various 
platforms including CPUs, GPUs and Xeon Phi. First, the 
CSR5 format is insensitive to the sparsity structure of the 
input matrix. Thus the single format can support an SpMV 
algorithm that is efficient both for regular matrices and for 
irregular matrices. Furthermore, we show that the overhead 
of the format conversion from the CSR to the CSR5 can be 
as low as the cost of a few SpMV operations. 

We compare the CSR5-based SpMV algorithrrQ with 11 
state-of-the-art formats and algorithms on four mainstream 
processors using 14 regular and 10 irregular matrices as a 
benchmark suite. For the 14 regular matrices in the suite, 
we achieve comparable or better performance over the pre¬ 
vious work. For the 10 irregular matrices, the CSR5 obtains 
average performance improvement of 17.6%, 28.5%, 173.0% 
and 293.3% (up to 213.3%, 153.6%, 405.1% and 943.3%) 
over the best existing work on dual-socket Intel CPUs, an 
nVidia GPU, an AMD GPU and an Intel Xeon Phi, respec¬ 
tively. For real-world applications such as a solver with only 
tens of iterations, the GSR5 format can be more practical 
because of its low-overhead for format conversion. 

Categories and Subject Descriptors 

G.1.3 [Numerical Linear Algebra]: Sparse, structured, 
and very large systems (direct and iterative methods); G.4 
[Mathematical Software]: Parallel and vector implemen¬ 
tations 

General Terms 

Algorithms, Experimentation, Performance 
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1. INTRODUCTION 

Over the past few decades, sparse matrix-vector multipli¬ 
cation (SpMV) has probably been the most studied sparse 
BLAS routine because of its importance in many scientific 
applications. The SpMV operation multiplies a sparse ma¬ 
trix A of size m X n by a dense vector x of size n and obtains a 
dense vector y of size m. Its naive sequential implementation 
can be very simple, and can be easily parallelized by adding 
a few pragma directives for the compilers. But to accelerate 
large-scale computation, parallel SpMV is still required to 
be hand-optimized with specific data storage formats and 
algorithms [Il[2|3l|l|6l[7l[T0l[T3l[Ta[T7l[T9l[2Tl[23[2l[29l 
150115111551 [M| . 

As a result, a conflict may emerge between the require¬ 
ments of SpMV and other sparse matrix operations such 
as preconditioning operations and sparse matrix-matrix 
multiplication [23]. The reason is that those operations com¬ 
monly require matrices stored in the basic formats such as 
the compressed sparse row (GSR). Therefore, when users 
construct a real-world application, they need to consider a 
cost of format conversion between the SpMV-oriented for¬ 
mats and the basic formats. Unfortunately, this conversion 
overhead may offset the benefits of using these specialized 
formats, in particular when only tens of iterations are needed 
in a solver. 

The conversion cost is mainly from the expensive structure- 
dependent parameter tuning of a storage format. For exam¬ 
ple, some block-based formats require Hnding a good 2D 
block size [SllZKTol EH El I34| . Moreover, some hybrid 
formats H HI may need completely different partitioning 
parameters for distinct input matrices. 

To avoid the format conversion overhead, a few algorithms 
have concentrated on accelerating CSR-based SpMV with ei¬ 
ther row block methods [T][T7| or segmented sum methods (5) 
However, each of the two types of methods has its own 
drawbacks. As for the row block methods, despite their good 
performance for regular matrices, they may provide very low 
performance for irregular matrices due to unavoidable load 
imbalance. In contrast, the segmented sum methods can 
achieve near perfect load balance, but suffer from high over¬ 
head due to more global synchronizations and global mem¬ 
ory accesses. Furthermore, none of the above work can avoid 
an overhead from preprocessing, since certain auxiliary data 



for the basic CSR format have to be generated for better 
load balancing [UE] or established primitives [3 [16]. 

Therefore, to be practical, an efficient format must satisfy 
two criteria: (1) it should limit format conversion cost by 
avoiding structure-dependent parameter tuning, and (2) it 
should support fast SpMV for both regular and irregular 
matrices. 

To meet these two criteria, in thi^aper, we have designed 
CSR5 (Compressed Sparse Row 5]_|, a new format directly 
extending the classic CSR format. The CSR5 format leaves 
one of the three arrays of the CSR format unchanged, stores 
the other two arrays in an in-place tile-transposed order, and 
adds two groups of extra auxiliary information. The format 
conversion from the CSR to the CSR5 merely needs two tun¬ 
ing parameters: one is hardware-dependent and the other 
is sparsity-dependent (but structure-independent). Because 
the added two groups of information are usually much shorter 
than the original three in the CSR format, very limited ex¬ 
tra space is required. Furthermore, the CSR5 format is 
SIMD-friendly and thus can be easily implemented on all 
mainstream processors with the SIMD units. Because of 
the structure-independence and the SIMD utilization, the 
CSR5-based SpMV algorithm can bring stable high through¬ 
put for both regular and irregular matrices. 

In this paper, we make the following contributions: 

• We propose CSR5, an efficient storage format with low 
conversion cost and high degree of parallelism. 

• We present a CSR5-based SpMV algorithm based on 
a redesigned low-overhead segmented sum algorithm. 


is the number of rows of the matrix, (2) col_idx array of 
size nnz stores column indices of the nonzeros, where nnz 
is the number of nonzeros of the matrix, and (3) val array 
of size nnz stores values of the nonzeros. Figure [T] shows an 
example. 
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Figure 1: A sparse matrix and its CSR format. 


2.2 Parallel Algorithms for CSR-based SpMV 

2.2.1 Row Block Methods 

In a given sparse matrix, rows are independent from each 
other. Therefore an SpMV operation can be parallelized 
on decomposed row blocks. A logical processing unit is re¬ 
sponsible for a row block and stores dot product results of 
the matrix rows with the vector x to corresponding loca¬ 
tions in the result y. When the SIMD units of a physical 
processing unit are available, the SIMD reduction sum oper¬ 
ation can be utilized for higher efficiency. These two meth¬ 
ods are respectively known as the CSR-scalar and the CSR- 
vector algorithms, and have been implemented on CPUs | 33| 
and GPUs |4l[29]. Algorithm [T] shows a parallel CSR-scalar 
method. 


• We implement the work on four mainstream devices: 
CPU, nVidia GPU, AMD GPU and Intel Xeon Phi. 

• We evaluate the CSR5 format in both isolated SpMV 
tests and iteration-based scenarios. 

We compare the CSR5 with 11 state-of-the-art formats 
and algorithms on dual-socket Intel CPUs, an nVidia GPU, 
an AMD GPU and an Intel Xeon Phi. By using 14 regular 
and 10 irregular matrices as a benchmark suite, we show 
that the CSR5 obtains comparable or better performance 
over the previous work for the regular matrices, and can 
greatly outperform the prior work for the irregular matri¬ 
ces. As for the 10 irregular matrices, the CSR5 obtains 
average performance improvement of 17.6%, 28.5%, 173.0% 
and 293.3% (up to 213.3%, 153.6%, 405.1% and 943.3%) 
over the second best work on the four platforms, respec¬ 
tively. Moreover, for iteration-based real-world scenarios, 
the CSR5 format achieves higher speedups because of the 
fast format conversion. To the best of our knowledge, this 
is the first time that a single storage format can outper¬ 
form state-of-the-art work on all four modern multicore and 
many core processors. 


Algorithm 1 SpMV using the CSR-scalar method. 
1: for i = 0 to m — 1 in parallel do 
2: y [i] ■<— 0 

3: for j = row_ptr[i] to row_ptr[i -f 1]—1 do 

4: y[i] ■<—y[i] -|-val[j] X x [col_idx [j] ] 

5: end for 

6: end for 


Despite the good parallelism, exploiting the scalability 
in modern multi-processors is not trivial for the row block 
methods. The performance problems mainly come from load 
imbalance for matrices which consist of rows with uneven 
lengths. Specifically, if one single row of a matrix is signif¬ 
icantly longer than the other rows, only a single core can 
be fully used while the other cores in the same chip may be 
completely idle. Although various strategies, such as data 
streaming |111117| , memory coalescing m, data reordering 
or reconstruction [31 [iSl |25] , static or dynamic binning [T] 
I17| and Dynamic Parallelism [I], have been developed, none 
of those can fundamentally solve the problem of load imbal¬ 
ance, and thus provide relatively low SpMV performance for 
the GSR format. 


2. PRELIMINARIES 
2.1 The CSR Format 

The CSR format for sparse matrices consists of three ar¬ 
rays: (1) row_ptr array which saves the start and end point¬ 
ers of the nonzeros of the rows. It has size m -I- 1, where m 

^The reason we call the storage format CSR5 is that it has 
five groups of data, instead of three in the classic CSR. 


2.2.2 Segmented Sum Methods 

Blelloch et al. [5] pointed out that the segmented sum 
may be more attractive for the CSR-based SpMV, since it 
is SIMD friendly and insensitive to the sparsity structure of 
the input matrix, thus overcoming the shortcomings of the 
row block methods. 

Segmented sum (which is a special case of the backward 
segmented scan) performs a reduction sum operation for the 
entries in each segment in an array. A segment has its first 










entry flagged as TRUE and the other entries flagged as FALSE. 
Algorithm [2] lists a serial segmented sum algorithm. Vector¬ 
ized parallel segmented sum algorithms can be found in [51 

[mill]. 


Algorithm 2 Serial segmented sum operation. 

1: function SEGMENTED_SUM(*in, *flag) 

2: length •<— SlZEOF(*in) 

3: for i = 0 to length — 1 do 

4: if flag[i] = TRUE then 

5: j i — 2 -h 1 

6: while flag[j] = FALSE && j < length do 

7: in[i] in[i] + in[j] 

8: J ^ j + 1 

9: end while 

10: else 

11: in[i] t— 0 

12: end if 

13: end for 

14: end function 


In the SpMV operation, the segmented sum treats each 
matrix row as a segment and calculates a partial sum for 
the entry-wise products generated in each row. The SpMV 
operation using the segmented sum methods consists of four 
steps: (1) generating an auxiliary bit_f lag array of size nnz 
from the row_ptr array. An entry in bit_flag is flagged 
as TRUE if its location matches the first nonzero entry of 
a row, otherwise it is flagged as FALSE, (2) calculating all 
intermediate entries (i.e., entry-wise products) to an array 
of size nnz, (3) executing the parallel segmented sum for the 
array, and (4) collecting all partial sums to the result vector 
J/ if a row is not empty. Algorithm O lists the pseudocode. 
Figure [2] illustrates an example using the matrix A plotted 
in Figure 1. 


Algorithm 3 Segmented sum method CSR-based SpMV. 
1: MALLOC(*bit_flag, nnz) 

2: MEMSET(*bit_flag, FALSE) 

3: for i = 0 to m — 1 in parallel do t> Step 1 

4: bit_f lag [row_ptr [i] ] <— TRUE 

5: end for 

6: MALLOC(*product, nnz) 

7: for j — 0 to nnz — 1 in parallel do > Step 2 

8: product [j] <— val [j] X x [col_idx [j] ] 

9: end for 

10: SEGMENTED_SUM(*product, *bit_flag) 0 Step 3 

11: for fc = 0 to m — 1 in parallel do > Step 4 

12: if row_ptr [/c] = row_ptr [fc + 1] then 

13: y[fc] ^ 0 

14: else 

15: y[fe] ■<—product [row_ptr [fc] ] 

16: end if 

17: end for 

18: FREE(*bit_flag) 

19: FREE(*product) 


We can see that once the heaviest workload, i.e., step 3, 
is parallelized through a fast segmented sum method de¬ 
scribed in [9i nsi I28| . nearly perfect load balance can be 
expected in all steps of Algorithm O However, in this con¬ 
text, the load balanced computation does not mean high 
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Figure 2: CSR-based SpMV using segmented sum. 


performance. Figure [3] shows that the row block method In 
cuSPARSE v6.5 can significantly outperform the segmented 
sum method in cuDPP v2.2 [T6]|28], while doing SpMV on 
relatively regular matrices (see Table [2] for the used bench¬ 
mark suite). 
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(segmented sum 
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■ CSR5-based SpMV 


Figure 3: Single precision SpMV performance. 


Why is this the case? We can see that the step 1 is a 
scatter operation and the step 4 is a gather operation, both 
from the row space of size m. This prevents the two steps 
from fusing with the steps 2 and 3 in the nonzero entry 
space of size nnz. In this case, more global synchroniza¬ 
tions and global memory accesses may degrade the overall 
performance. Previous research HEn] has found that the 
segmented sum may be more suitable for the COO (coordi¬ 
nate storage format) based SpMV, since the fully stored row 
index data can convert the steps 1 and 4 to the nonzero entry 
space: the bit_flag array can be generated by comparison 
of neighbor row indices, and the partial sums in the product 
array can be directly saved to y since their final locations 
are easily known from the row index array. Further, Yan et 
al. [35| and Tang et al. m reported that some variants of 
the COO format can also benefit from the segmented sum. 
However, it is well known that accessing row indices in the 
COO pattern brings higher off-chip memory pressure, which 
is just what the CSR format tries to avoid. 

In the following, we will show that the CSR5-based SpMV 
can utilize both the segmented sum for load balance and the 
compressed row data for better load/store efficiency. In this 
way, the CSR5-based SpMV can obtain up to 4x speedup 
(see Figure (21) over the CSR-based SpMV using the seg¬ 
mented sum primitive |28 |. 
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Figure 4: The CSR5 storage format of a sparse matrix A of size 8x8. The five groups of information include 
row_ptr, tile_ptr, col_idx, val and tile_desc. 


3. THE CSR5 STORAGE FORMAT 

3.1 Basic Data Layout 

To achieve near-optimal load balance for matrices with 
any sparsity structures, we first evenly partition all nonzero 
entries to multiple 2D tiles of the same size. Thus when ex¬ 
ecuting parallel SpMV operation, a compute core can con¬ 
sume one or more 2D tiles, and each SIMD lane of the core 
can deal with one column of a tile. Then the main skeleton 
of the CSR5 format is simply a group of 2D tiles. The CSR5 
format has two tuning parameters: oj and a, where w is a 
tile’s width and a is its height. In fact, the CSR5 format 
only has these two tuning parameters. 

Further, we need extra information to efficiently compute 
SpMV. For each tile, we introduce a tile pointer tile_ptr 
and a tile descriptor tile_desc. Meanwhile, the three ar¬ 
rays, i.e., row pointer row_ptr, column index col_idx and 
value val, of the classic CSR format are directly integrated. 
The only difference is that the col_idx data and the val 
data in each complete tile are in-place transposed (i.e., from 
row-major order to column-major order) for coalesced mem¬ 
ory access from contiguous SIMD lanes. If the last entries 
of the matrix do not fill up a complete 2D tile (i.e., nnz 
mod (cuct) 7^ 0), they just remain unchanged and discard 
their tile_desc. 

In Figure |4l an example matrix A of size 8x8 with 34 
nonzero entries is stored in the CSR5 format. When oj = 4 
and a — 4, the matrix is divided into three tiles including 
two complete tiles of size 16 and one incomplete tile of size 
2. The arrays col_idx and val in the two complete tiles are 
stored in tile-level column-major order now. Moreover, only 
the first two tiles have tile_desc, since they are complete. 

3.2 Auto-Tuned Parameters w and a 

Because the computational power of the modern multicore 
or manycore processors is mainly from the SIMD units, we 
design an auto-tuning strategy for high SIMD utilization. 

First, the tile width ui is set to the size of the SIMD exe¬ 
cution unit of the used processor. Then an SIMD unit can 
consume a 2D tile in a steps without any explicit synchro¬ 
nization, and the vector registers can be fully utilized. For 
the double precision SpMV, we always set a; = 4 for CPUs 
with 256-bit SIMD units, a; = 32 for the nVidia GPUs, 


cu = 64 for the AMD GPUs, and a; = 8 for Intel Xeon Phi 
with 512-bit SIMD units. Therefore, lu can be automatically 
decided once the processor type used is known. 

The other parameter a is decided by a slightly more com¬ 
plex process. For a given processor, we consider its on-chip 
memory strategy such as cache capacity and prefetching 
mechanism. If a 2D tile of size uj x a can empirically bring 
better performance than using the other sizes, the a is sim¬ 
ply chosen. We found that the x86 processors fall into this 
category. For the double precision SpMV on CPUs and Xeon 
Phi, we always set a to 16 and 12, respectively. 

As for GPUs, the tile height a further depends on the spar¬ 
sity of the matrix. Note that the “sparsity” is not equal to 
“sparsity structure”. We define “sparsity” to be the average 
number of nonzero entries per row (or nnz/row for short). In 
contrast, “sparsity structure” is much more complex because 
it includes 2D space layout of all nonzero entries. 

On GPUs, we have several performance considerations on 
mapping the value nnz/row to a. First, a should be large 
enough to expose more thread-level local work and to amor¬ 
tize a basic cost of the segmented sum algorithm. Second, 
it should not be too large since a larger tile potentially gen¬ 
erates more partial sums (i.e., entries to store to y), which 
bring higher pressure to last level cache write. Moreover, for 
the matrices with large nnz/row, a may need to be small. 
The reason is that once the whole tile is located inside a ma¬ 
trix row (i.e., only one segment is in the tile), the segmented 
sum converts to a fast reduction sum. 

Therefore, for the nnz/row to a mapping on GPUs, we 
dehne three simple bounds: r, s and t. The first bound r 
is designed to prevent a too small a. The second bound s 
is used for preventing a too large a. But when nnz/row is 
further larger than the third bound t, a is set to a small 
value u. Then we have 

if nnz/row < r 

if r < nnz/row < s 

if s < nnz/row < t 

if t < nnz/row. 

The three bounds, r, s and t, and the value u are hardware- 
dependent, meaning that for a given processor, they can 
be fixed for use. For example, to execute double precision 
SpMV on nVidia Maxwell GPUs and AMD GGN GPUs, we 
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always set <r,s,t,u> = <4, 32, 256, 4> and <4, 7,256, 4>, re¬ 
spectively. As for future processors with new architectures, 
we can obtain the four values through some simple bench¬ 
marks during initialization, and then use them for later runs. 
So the parameter a can be decided once the very basic in¬ 
formation of a matrix and a low-level hardware are known. 

Therefore, we can see that the parameter tuning time be¬ 
comes negligible because uj and cr are easily obtained. This 
can save a great deal of preprocessing time. 

3.3 Tile Pointer Information 

The added tile pointer information tile_ptr stores the 
row index of the first matrix row in each tile, indicating the 
starting position for storing its partial sums to the vector y. 
By introducing tile_ptr, each tile can find its own starting 
position, allowing tiles to execute in parallel. The size of 
the tile_ptr array is p -|- 1, where p — \nnz/{ui(jy\ is the 
number of tiles in the matrix. For the example in Figure 01 
the first entry of Tile 1 is located in the 4th row of the 
matrix, and thus 4 is set as its tile pointer. To build the 
array, we binary search the index of the first nonzero entry 
of each tile on the row_ptr array. Lines 1-4 in Algorithm 0] 
show this procedure. 

Recall that an empty row has exactly the same row pointer 
information as its first non-empty right neighbor row (see the 
second row in the matrix A in Figure [T]). Thus for the non¬ 
empty rows with an empty left neighbor, we need a specific 
process (which is similar to lines 12-16 in Algorithm [S]) to 
store their partial sums to correct positions in y. To recog¬ 
nize whether the specific process is required, we give a hint 
to the other parts (i.e., tile descriptor data) of the CSR5 
format and the CSR5-based SpMV algorithm. Here we set 
an entry in tile_ptr to its negative value, if its correspond¬ 
ing tile includes any empty rows. Lines 5-12 in Algorithm0] 
show this operation. 


Algorithm 4 Generating tile_ptr. 

1: for tid = 0 to p in parallel do 

2: bnd tid X oj X a 

3: tile_ptr[tid] BINARY_SEARCH(*row_ptr, bnd) —1 

4: end for 

5: for tid = 0 to p — 1 do 

6: for rid = tile_ptr[tid] to tile_ptr[tid -|- 1] do 

7: if row_ptr[rid] = row_ptr[rid-|- 1] then 

8: tile_ptr[tid] ■<— NEGATIVE(tile_ptr[tid]) 

9: break 

10: end if 

11: end for 

12: end for 


If the first tile has any empty rows, we need to store a —0 
(negative zero) for it. To record —0, here we use unsigned 
32- or 64-bit integer as data type of the tile_ptr array. 
Therefore, we have 1 bit for explicitly storing the sign and 
31 or 63 bits for an index. For example, in our design, tile 
pointer —0 is represented as a binary style TOGO ... 000’, 
and tile pointer 0 is stored as ‘0000 ... 000’. To the best 
of our knowledge, the index of 31 or 63 bits is completely 
compatible to most numerical libraries such as Intel MKL. 
Moreover, reference implementation of the recent high per¬ 
formance conjugate gradient (HPCG) benchmark [14] also 
uses 32-bit signed integer for problem dimension no more 
than 2®^ and 64-bit signed integer for problem dimension 


larger than that. Thus it is safe to save 1 bit as the empty 
row hint and the other 31 or 63 bits as a ‘real’ row index. 

3.4 Tile Descriptor Information 

Only having the tile pointer is not enough for a fast SpMV 
operation. For each tile, we also need four extra hints: (1) 
bit_flag of size lu x a, which indicates whether an entry 
is the first nonzero of a matrix row, (2) y_offset of size 
LO used to further let each column know where the starting 
point to store its local partial sums is, (3) seg_off set of size 
cj used to accelerate the local segmented sum inside a tile, 
and (4) empty_of f set of unfixed size (but no longer than 
oj X a) constructed to help the partial sums to find correct 
locations in y if the tile includes any empty rows. The tile 
descriptor tile_desc is defined to denote a combination of 
the above four groups of data. 

Generating bit_f lag is straightforward. The procedure is 
very similar to lines 3-5 in Algorithm|3| The main difference 
is that the bit flags are saved in column-major order, which 
matches the in-place transposed col_idx and val. Addi¬ 
tionally, the first entry of each tile’s bit_f lag is set to TRUE 
for sealing the first segment from the top and letting 2D tiles 
to be independent from each other. 

The array y_offset of size oj is used to help the columns 
in each tile knowing where the starting points to store their 
partial sums to y are. In other words, each column has one 
entry in the array y_offset as a starting point offset for 
all segments in the same column. We save a row index off¬ 
set (i.e., relative row index) for each column in y_offset. 
Thus for the ith column in the tidth tile, by calculating 
tile_ptr [tid] + y_offset[i], the column knows where its 
own starting position in y is. Thus the columns can work in a 
high degree of parallelism without waiting for a synchroniza¬ 
tion. Generating y_offset is simple: each column counts 
the number of TRUEs in its previous columns’ bit_flag ar¬ 
ray. Gonsider Tile 1 in Figure 0] as an example: because 
there are 3 TRUEs in the 1st column, the 2nd column’s corre¬ 
sponding value in y_off set is 3. In addition, since there are 
in total 4 TRUEs in the 1st, 2nd and 3rd columns’ bit_f lag. 
Tile I’s y_offset[3]= 4. Algorithm [5] lists how to generate 
y_offset for a single 2D tile in an SIMD-friendly way. 


Algorithm 5 Generating y_offset and seg_offset. 
1: MALLOC(*tmp_bit, oo) 

2: MEMSET(*tmp_bit, FALSE) 

3: for i = 0 to w — 1 in parallel do 
4: y_offset[i] 0 

5: for j = 0 to (T — 1 do 

6: y_offset[i] 4— y_offset[i] -f bit_flag[i][j] 

7: tmp_bit[i] 4— tmp_bit[i] V bit_f lag[i] [ji] 

8: end for 

9: seg_offset[i] 4— 1— tmp_bit[i] 

10: end for 

11: EXCLUSIVE_PREFIX_SUM_SCAN(*y_offset) 

12: SEGMENTED_SUM(*seg_offset, *tmp_bit) 

13: FREE(*tmp_bit) 


The third array seg_offset of size oj is used for acceler¬ 
ating a local segmented sum in the workload of each tile. 
The local segmented sum is an essential step that synchro¬ 
nizes partial sums in a 2D tile (imagine multiple columns 
in the tile come from the same matrix row). In the previ¬ 
ous segmented sum (or segmented scan) method l5]|9]|28i 









I15| . the local segmented sum is complex and not efficient 
enough. Thus we prepare seg_offset as an auxiliary array 
to facilitate implementation of segmented sum by way of 
the prefix-sum scan, which is a well optimized fundamental 
primitive for the SIMD units. 

To generate seg_offset, we let each column search its 
right neighbor columns and count the number of contiguous 
columns without any TRUEs in their bit_flag. Using Tile 0 
in Figure [4] as an example, its 2nd column has one and only 
one right neighbor column (the 3rd column) without any 
TRUEs in its bit_flag. Thus the 2nd column’s seg_offset 
value is 1. In contrast, because the other three columns (the 
1st, 3rd and 4th) do not have any ‘all FALSE’ right neighbors, 
their values in seg_offset is 0. Algorithm [5] shows how to 
generate seg_offset using an SIMD-friendly method. 

Algorithm [6] and Figure [5] show the fast segmented sum 
using seg_offset and an inclusive prefix-sum scan. The 
principle of this operation is that the prefix-sum scan is es¬ 
sentially an increment operation. Once a segment knows the 
distance (i.e., offset) between its head and its tail, its partial 
sum can be deduced from its prefix-sum scan results. There¬ 
fore, the more complex segmented sum operation in [3IS1I1H1 
I15| can be converted to a faster prefix-sum scan operation 
(line 5) and a few arithmetic operations (lines 6-8). 


Algorithm 6 Fast segmented sum using seg_offset. 

1; function FAST_SEGMENTED_SUM(*in, *seg_of f set) 

2: length SlZEOF(*in) 

3: MALLOC(*tmp, length) 

4: MEMCPY(*tmp, *in) 

5: INCLUSIVE_PREFIX_SUM_SCAN(*in) 

6: for i = 0 to length — 1 in parallel do 

7: in[i] 4—in[i-|-seg_offset [i]] — in[i] +tmp[i] 

8: end for 

9: FREE(*tmp) 

10: end function 


seg_offset[] = 
in[] = 
tmp(=in)[] = 
in(scanned)[] = 

in(seg_sum)[] = 

Figure 5: An example of the fast segmented sum. 

The last array empty_of f set occurs when and only when 
a 2D tile includes any empty rows (i.e., its tile pointer is 
negative). Because an empty row of a matrix has the same 
row pointer with its rightmost non-empty neighbor row (re¬ 
call the second row in the matrix A in Figure 1), y_offset 
will record an incorrect offset for it. We correct for this by 
storing correct offsets for segments within a tile. Thus the 
length of empty_of f set is the number of segments (i.e., the 
total number of TRUEs in bit_flag) in a tile. For example. 
Tile 0 in Figure 0] has 4 entries in its empty_off set since 
its bit_flag includes 4 TRUEs. Algorithm 0 lists the pseu¬ 
docode that generates empty_of f set for a tile that contains 
at least one empty row. 
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Algorithm 7 Generating empty_of f set for the tidt\i tile. 
1: length •<— REDUCTlON_SUM(*bit_flag) 

2: MALLOC(*empty_offset, length) 

3: eid •<— 0 

4: for j = 0 to cj — 1 do 
5: for j = 0 to (T — 1 do 

6: if bit_f lag [i] [j] = TRUE then 

7: ptr -G- tid xcuxa + ixa+j 

8: idx •<— BINARY_SEARCH(*row_ptr, ptr) —1 

9: idx idx — REMOVE_SIGN(tile_ptr [tid] ) 

10: empty_off set [eid] idx 

11: eid •<— eid -|- 1 

12: end if 

13: end for 

14: end for 


3.5 Storage Details 

To store the tile_desc arrays in a space-efficient way, 
we find upper bounds to the entries and utilize the bit-field 
pattern. First, since entries in y_offset store offset dis¬ 
tances inside a 2D tile, they have an upper bound of oja. So 
[log 2 (wcr)] bits are enough for each entry in y_offset. For 
example, when w = 32 and a = 16, 9 bits are enough for 
each entry. Second, since seg_offset includes offsets less 
than to, [log 2 (a;)] bits are enough for an entry in this array. 
For example, when to = 32, 5 bits are enough for each entry. 
Third, bit_f lag stores a 1-bit flags for each column of a 2D 
tile. When a = 16, each column needs 16 bits. So 30 (i.e., 
9 -f 5 -f 16) bits are enough for each column in the exam¬ 
ple. Therefore, for a tile, the three arrays can be stored in a 
compact bit-field composed of u 32-bit unsigned integers. If 
the above example matrix has 32-bit integer row index and 
64-bit double precision values, only around 2% extra space 
is required by the three newly added arrays. 

The size of empty_of f set depends on the number of groups 
of contiguous empty rows, since we only record one offset 
for the rightmost non-empty row with any number of empty 
rows as its left neighbors. 

3.6 The CSR5 for Other Matrix Operations 

Since we in-place transposed the CSR arrays col_idx and 
val, a conversion from the CSR5 to the CSR is required for 
doing other sparse matrix operations using the CSR format. 
This conversion is simply removing tile_ptr and tile_desc 
and transposing col_idx and val back to row-major order. 
Thus the conversion can be very fast. Further, since the 
CSR5 is a superset of the CSR, any entry accesses or slight 
changes can be done directly in the CSR5 format, without 
any need to convert it to the CSR format. Additionally, 
some applications such as finite element methods can di¬ 
rectly assemble sparse matrices in the CSR5 format from 
data sources. 


4. THE CSR5-BASED SPMV ALGORITHM 

Because all computations of the information (tile_ptr, 
tile_desc, col_idx and val) of 2D tiles are independent 
of each other, they can execute concurrently. On GPUs, 
we assign a bunch of threads (i.e., warp in nVidia GPUs 
or wavefront in AMD GPUs) for each tile. On CPUs and 
Xeon Phi, we use OpenMP pragma for assigning the tiles to 
available x86 cores. Furthermore, the columns inside a tile 

































































are independent of each other as well. So we assign a thread 
on GPU cores or an SIMD lane on x86 cores to each column 
in a tile. 

While running the CSR5-based SpMV, each column in a 
tile can extract information from bit_flag and label the 
segments in its local data to three colors: (1) red means a 
sub-segment unsealed from its top, (2) green means a com¬ 
pletely sealed segment existed in the middle, and (3) blue 
means a sub-segment unsealed from its bottom. There is an 
exception that if a column is unsealed both from its top and 
from its bottom, it is colored to red. 

Algorithm |5] shows the pseudocode of the CSR5-based 
SpMV algorithm. Figure |6] plots an example of this pro¬ 
cedure. We can see that the green segments can directly 
save their partial sums to y without any synchronization, 
since the indices can be calculated by using tile_ptr and 
y_offset. In contrast, the red and the blue sub-segments 
have to further add their partial sums together, since they 
are not complete segments. For example, the sub-segments 
B 2 , R 2 and Rs in Figure [S] have contributions to the same 
row, thus an addition is required. This addition operation 
needs the fast segmented sum shown in Algorithm [S] and 
Figure [5] Furthermore, if a tile has any empty rows, the 
empty_off set array is accessed to get correct global indices 
in y. 



Figure 6: The CSR5-based SpMV in a tile. Partial 
sums of the green segments are directly stored to y. 
The red and the blue sub-segments require an extra 
segmented snm before issuing off-chip write. 

Consider the synchronization among the tiles, since the 
same matrix row can be influenced by multiple 2D tiles run¬ 
ning concurrently, the first and the last segments of a tile 
need to store to y by atomic add (or a global auxiliary array 
used in device-level reduction, scan or segmented scan [13 
I28p. In Figure (3 the atomic add operations are highlighted 
by arrow lines with plus signs. 

For the last entries not in a complete tile (e.g., the last 
two nonzero entries of the matrix in Figure lU, we execute 
a conventional CSR-vector method after all of the complete 
2D tiles have been consumed. Note that even though the 
last tile (i.e., the incomplete one) does not have tile_desc 
arrays, it can extract a starting position from tile_ptr. 

In Algorithm (3 we can see that the main computation 
(lines 5-21) only contains very basic arithmetic and logic 
operations that can be easily programmed on all mainstream 
processors with SIMD units. As the most complex part in 
our algorithm, the fast segmented sum operation (line 22) 


Algorithm 8 The CSR5-based SpMV for the tidth. tile. 

1: MALLOC(*tmp, u)) 

2: MEMSET(*tmp, 0) 

3: MALLOC(*last_tmp, lu) 

4: /*use empty_of f set [y_off set [i] ] instead of 
y_offset[i] for a tile with any empty rows*/ 

5: for i = 0 to ai — 1 in parallel do 

6: sum t— 0 

7; for j = 0 to (7 — 1 do 

8: ptr •<— tid xuixa + jxui + i 

9: sum t— sum + val [ptr] X x [col_idx [ptr] ] 

10: /*check bit_f lag [i] [j] */ 

11: if /*end of a red sub-segment*/ then 

12: tmp[i — !]■<— sum 

13: sum 0 

14: else if /*end of a green segment*/ then 

15: y [tile_ptr [tid] + y_offset[i]] sum 

16: y_offset[i] y_of f set [i] -fl 

17: sum 0 

18: end if 

19: end for 

20: last_tmp[i] sum//end of a blue sub-segment 

21: end for 

22: FAST_SEGMENTED_SUM(*tmp, *seg_of f set) 0 Alg. |3 

23: for i = 0 to tj — 1 in parallel do 

24: last_tmp[i] last_tmp [i] 4-tmp[i] 

25: y [tile_ptr [tid] + y_offset[i]] ■«— last_tmp [i] 

26: end for 
27: FREE(*tmp) 

28: FREE(*last_tmp) 


only requires a prefix-sum scan, which has been well-studied 
and can be efficiently implemented by using CUDA, OpenCL 
or x86 SIMD intrinsics. 


5. EXPERIMENTAL RESULTS 
5.1 Experimental Setup 

We evaluate the CSR5-based SpMV and 11 state-of-the- 
art formats and algorithms on four mainstream platforms: 
dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and 
an Intel Xeon Phi. The platforms and participating ap¬ 
proaches are shown in Table [T] 

Host of the two GPUs is a machine with AMD A10-7850K 
APU, dual-channel DDR3-1600 memory and 64-bit Ubuntu 
Linux vl4.04 installed. Host of the Xeon Phi is a machine 
with Intel Xeon E5-2680 v2 CPU, quad-channel DDR3-1600 
memory and 64-bit Red Hat Enterprise Linux v6.5 installed. 
The two GPU platforms use the g-l—I- compiler v4.8.2. The 
two Intel machines always set the Intel C/C-I--I- compiler 
15.0.1 as default. 

Here we evaluate double precision SpMV. So cuDPP li¬ 
brary [161 El], clSpMV [29] and yaSpMV [34] are not in¬ 
cluded since they only support single precision floating point 
as data type. Two recently published methods |20l |30| are 
not tested since the source code is not available to us yet. 

We use OpenCL profiling scheme for timing SpMV on the 
AMD platform and record wall-clock time on the other three 
platforms. For all participating formats and algorithms, we 
evaluate SpMV 10 times (each time contains 1000 runs and 
records the average) and report the best observed result. 






















































The testbeds 

The participating formats and algorithms 

Dual-socket Intel Xeon E5-2667 v3 

(Haswell, 2x8 cores @ 3.2 GHz, 1.64 SP 
TFlops, 819.2 DP GFlops, 64 GB 

GDDR4, ECG-on, 2x68.3 GB/s 
bandwidth). 

(1) The CSR-based SpMV in Intel MKL 11.2 Update 1. 

(2) BiGSB vl.2 using GSB [7] with bitmasked register block [^. 

(3) pOSKI vl.0.0 [S] using OSKI vl.O.lh |311132| kernels. 

(4) The CSR5-based SpMV implemented by using OpenMP and AVX2 intrinsics. 

An nVidia GeForce GTX 980 

(Maxwell GM204, 2048 CUDA cores @! 

1.13 GHz, 4.61 SP TFlops, 144.1 DP 
GFlops, 4 GB GDDR5, 224 GB/s 
bandwidth, driver v344.16). 

(1) The best CSR-based SoMV 141 from cuSPARSE v6.5 and GUSP vO.4.0 llll. 

(2) The best HYB [J| from the above two libraries. 

(3) BRC [2] with texture cache enabled. 

(4) ACSR ^ with texture cache enabled. 

(5) The CSR5-based SpMV implemented by using CUDA v6.5. 

An AMD Radeon R9 290X (GCN 
Hawaii, 2816 Radeon cores @ 1.05 GHz, 
5.91 SP TFlops, 739.2 DP GFlops, 4 GB 
GDDR5, 345.6 GB/s bandwidth, driver 
vl4.41). 

(1) The CSR-vector method [4] extracted from CUSP vO.4.0 llll. 

(2) The GSR-Adaptive algorithm [T^ implemented in ViennaCL vl.6.2 |26|. 

(3) The CSR5-based SpMV implemented by using OpenCL vl.2. 

An Intel Xeon Phi SllOp (Knights 
Corner, 60 x86 cores ® 1.05 GHz, 2.02 SP 
TFlops, 1.01 DP TFlops, 8 GB GDDR5, 
ECC-on, 320 GB/s bandwidth, driver 
V3.4-1, ^OS v2.6.38.8). 

(1) The CSR-based SpMV in Intel MKL 11.2 Update 1. 

(2) The ESB [24] with dynamic scheduling enabled. 

(3) The CSR5-based SpMV implemented by using OpenMP and MIC-KNC intrinsics. 


Table 1: The testbeds and participating formats and algorithms. 


5.2 Benchmark Suite 

In Table [2] we list 24 sparse matrices as our benchmark 
suite for all platforms. The first 20 matrices have been 
widely adopted in previous SpMV research [21 H El EH 
12911331134| . The other 4 matrices are chosen since they have 
more diverse sparsity structures. All matrices except Dense 
are downloadable at the University of Florida Sparse Matrix 
Collection m- 

To achieve a high degree of differentiation, we categorize 
the 24 matrices in TableElinto two groups: (1) regular group 
with the upper 14 matrices, (2) irregular group with the 
lower 10 matrices. This classification is mainly based on 
the minimum, average and maximum lengths of the rows. 
Matrix dc2 is a representative of the group of irregular ma¬ 
trices. Its longest single row contains 114K nonzero entries, 
i.e., 15% nonzero entries of the whole matrix with 117K 
rows. This sparsity pattern challenges the design of efficient 
storage format and SpMV algorithm. 


Id 

Name 

Dimensions 

nnz 

nnz per row 
(min, avg, max) 

rl 

Dense 

2Kx2K 

4.0M 

2K, 2K, 2K 

r2 

Protein 

36KX36K 

4.3M 

18, 119, 204 

r3 

FEM/Spheres 

83KX83K 

6.0M 

1, 72, 81 

r4 

FEM / Cantilever 

62KX62K 

4.0M 

1, 64, 78 

r5 

Wind Tunnel 

218KX218K 

11.6M 

2, 53, 180 

r6 

QCD 

49KX49K 

1.9M 

39, 39, 39 

r7 

Epidemiology 

526KX526K 

2.1M 

2, 3, 4 

r8 

FEM/Harbor 

47KX47K 

2.4M 

4, 50, 145 

r9 

FEM/Ship 

141KX141K 

7.8M 

24, 55, 102 

rlO 

Economics 

207KX207K 

1.3M 

1, 6, 44 

rll 

FEM/Accelerator 

121KX121K 

2.6M 

0, 21, 81 

rl2 

Circuit 

171KX171K 

959K 

1, 5, 353 

rl3 

Ga41As41H72 

268KX268K 

1S.5M 

18, 68, 702 

rll 

Si41Ge41H72 

186KX186K 

15.OM 

13, 80, 662 

il 

Webbase 

IMxlM 

3.1M 

1, 3, 4.7K 

i2 

LP 

IKxl.lM 

11.3M 

1, 2.6K, 56.2K 

i3 

Circuit5M 

5.6MX5.6M 

59.5M 

1, 10, 1.29M 

i4 

eu-2005 

S63KX863K 

19.2M 

0, 22, 6.9K 

i5 

in-2004 

1.4MX1.4M 

16.9M 

0, 12, 7.8K 

i6 

mipl 

66KX66K 

10.4M 

4, 155, 66.4K 

i7 

ASIC_680k 

683KX683K 

3.9M 

1, 6, 395K 

i8 

dc2 

117KX117K 

766K 

1, 7, 114K 

i9 

FullChip 

2.9MX2.9M 

26.6M 

1, 9, 2.3M 

ilO 

ins2 

309KX309K 

2.8M 

5, 9, 309K 


Table 2: The benchmark suite. 


5.3 Isolated SpMV Performance 

Figure [71 shows double precision SpMV performance of the 
14 regular matrices on the four platforms. We can see that, 
on average, all participating algorithms deliver comparable 
performance. On the CPU platform, Intel MKL obtains the 
best performance on average and the other 3 methods be¬ 
have similar. On the nVidia GPU, the CSR5 delivers the 
highest throughput. The ACSR format is slower than the 
others, because its binning strategy leads to non-coalesced 
memory access. On the AMD GPU, the CSR5 achieves the 
best performance. Although the dynamic assigning in the 
CSR-Adaptive method can obtain better scalability than the 
CSR-vector method, it still cannot achieve near perfect load 
balance. On the Xeon Phi, the CSR5 is slower than Intel 
MKL and the ESB format. The main reason is that the cur¬ 
rent generation of Xeon Phi can only issue up to 4 relatively 
slow threads per core (i.e., up to 4x 60 threads in total on the 
used device), and thus the latency of gathering entries from 
vector X becomes the main bottleneck. Then reordering or 
partitioning nonzero entries based on the column index for 
better cache locality behaves well in the ESB-based SpMV. 
However, in Section 5.6 we will show that this strategy leads 
to very high preprocessing cost. 

Figure [8] shows double precision SpMV performance of the 
10 irregular matrices. We can see that the irregularity can 
dramatically impact SpMV throughput of some approaches. 
On the CPU platform, the row block method based Intel 
MKL is now slower than the other methods. The CSR5 
outperforms the others because of better SIMD efficiency 
from the AVX2 intrinsics. On the nVidia GPU, the CSR5 
brings the best performance because of the near perfect load 
balance. The other two irregularity-oriented formats, HYB 
and ACSR, behave well but still suffer from imbalanced work 
decomposition. Note that the ACSR format is based on Dy¬ 
namic Parallelism, a technical feature only available on re¬ 
cently released nVidia GPUs. On the AMD GPU, the CSR5 
greatly outperforms the other two algorithms using the row 
block methods. Because the minimum work unit of the CSR- 
Adaptive method is one row, the method delivers degraded 
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Figure 7: The SpMV performance of the 14 regular matrices. (nGPU=nVidia GPU, aGPU=AMD GPU) 
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Figure 8: The SpMV performance of the 10 irregular matrices. (nGPU=nVidia GPU, aGPU=AMD GPU) 


performance for matrices -with very long row^. On the Xeon 
Phi, the CSR5 can greatly outperform the other two meth¬ 
ods in particular when matrices are too irregular to expose 
cache locality of x by the ESB format. Furthermore, since 


®Note that we use an implementation of the CSR-Adaptive 
from the ViennaCL Library. The AMD’s version of the CSR- 
Adaptive, which is not available to us yet, may have slightly 
different performance. 


ESB is designed on top of the ELLPACK format, it cannot 
obtain the best performance for some irregular matrices. 

Overall, the CSR5 achieves better performance (on the 
two GPU devices) or comparable performance (on the two 
x86 devices) for the 14 regular matrices. For the 10 irregular 
matrices, compared to pOSKI, AGSR, GSR-Adaptive and 
ESB as the second best methods, the CSR5 obtains average 
performance gain of 17.6%, 28.5%, 173.0% and 293.3% (up 
to 213.3%, 153.6%, 405.1% and 943.3%), respectively. 





















































































































































































































































































































Benchmark 

The 14 regular matrices 

The 10 irregular matrices 

Metrics 

Preprocessing 
to SpMV 
ratio 

Speedup 
of #iter.=50 

Speedup 
of #iter.=500 

Preprocessing 
to SpMV 
ratio 

Speedup 
of #iter.=50 

Speedup 
of 7^iter.=500 

avg 

best 

avg 

best 

avg 

best 

avg 

best 

CPU-BiCSB 

538.01X 

0.06x 

0.1 lx 

0.35x 

o.eox 

331.77X 

0.13x 

0.24x 

o.eox 

1.07x 

CPU-pOSKI 

12.30x 

0.43x 

0.88x 

0.57x 

0.99x 

10.71X 

0.62x 

1.66x 

0.83x 

2.43x 

CPU-CSR5 

6.14X 

0.52x 

0.74x 

0.59x 

0.96x 

3.69x 

0.91x 

2.37x 

1.03x 

2.93x 

nGPU-HYB 

13.73X 

0.73x 

0.98x 

0.92x 

1.21x 

28.59x 

1.86x 

13.61x 

2.77x 

25.57X 

nGPU-BRC 

151.21x 

0.26x 

0.31x 

O.SOx 

0.98x 

51.85X 

1.17x 

7.60x 

2.49x 

15.47x 

nGPU-ACSR 

l.lOx 

0.68x 

0.93x 

0.72x 

1.03x 

3.04x 

5.05x 

41.47x 

5.41x 

51.95X 

nGPU-GSR5 

3.06x 

1.04x 

1.34x 

l.lOx 

1.45x 

1.99x 

6.43x 

48.37x 

6.77x 

52.31X 

aGPU-CSR-Adaptive 

2.68x 

l.OOx 

1.33x 

1.07x 

1.48x 

1.16x 

3.02x 

27.88X 

S.llx 

28.22x 

aGPU-GSR5 

4.99x 

1.04x 

1.39x 

1.14x 

1.51x 

S.lOx 

5.72x 

135.32X 

6.04X 

141.94X 

Phi-ESB 

922.47X 

0.05x 

0.15x 

0.33x 

0.88x 

222.19X 

0.27x 

1.15x 

1.30x 

2.96x 

Phi-GSR5 

11.52X 

0.54x 

1.14x 

0.65x 

1.39x 

9.45x 

3.43x 

18.48X 

4.10X 

21.18X 


Table 3: Preprocessing cost and its impact on the iteration-based scenarios. 


5.4 Effects of Auto-Tuning 

In section 3.2, we discussed a simple auto-tuning scheme 
for the parameter cr on GPUs. Figure [9] shows its effects (the 
X axis is the matrix ids). We can see that compared to the 
best performance chosen from a range of a = 4 to 48, the 
auto-tuned cr does not have obvious performance loss. On 
the nVidia GPU, the performance loss is on average -4.2%. 
On the AMD GPU, the value is on average -2.5%. 
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(b) The AMD R9-290X GPU. 


Figure 9: Auto-tuning effects on the two GPUs. 


5.5 Format Conversion Cost 

The format conversion from the GSR to the GSRS includes 
four steps: (1) memory allocation, (2) generating tile_ptr, 
(3) generating tile_desc, and (4) transposition of col_idx 
and val arrays. Figure fTOl shows the cost of the four steps 
for the 24 matrices (the x axis is the matrix ids) on the four 
used platforms. Cost of one single SpMV operation is used 
for normalizing format conversion cost on each platform. We 
can see that the conversion cost can be on average as low as 
the overhead of a few SpMV operations on the two GPUs. 
On the two x86 platforms, the conversion time is longer (up 
to cost of around 10-20 SpMV operations). The reason is 
that the conversion code is manually SIMDized using CUBA 


or OpenCL on GPUs, but only auto-parallelized by OpeuMP 
on x86 processors. 
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(a) The CPU. (b) The nVidia GPU. 



(c) The AMD GPU. (d) The Xeon Phi. 

Figure 10: The normalized format conversion cost. 


5.6 Iteration-Based Scenarios 

Since both the preprocessing (i.e., format conversion from 
a basic format) time and the SpMV time are important for 
real-world applications, we have designed an iteration-based 
benchmark. This benchmark measures the overall perfor¬ 
mance of a solver with n iterations. We assume the in¬ 
put matrix is already stored in the GSR format. So the 
overall cost of using the GSR format for the scenarios is 
nTspmv, where is execution time of one GSR-based 

SpMV operation. For a new format, the overall cost is 
Tpr^ + , where Tp^™ is preprocessing time and the 

T^pmv is one SpMV time using the new format. Thus we 
can calculate speedup of a new format over the GSR format 
in the scenarios, through (nT^p^^)/(Tp^^ +nT^pmv)- 

Table |3] shows the new formats’ preprocessing cost (i.e., 
Tpr™/T^pmv) and their speedups over the GSR format in 
the iteration-based scenarios when n = 50 and n = 500. 
The emboldened font in the table shows the highest positive 
speedups on each platform. The compared baseline is the 
fastest GSR-based SpMV implementation (i.e., Intel MKL, 



























































































nVidia cuSPARSE/CUSP, CSR-vector from CUSP, and In¬ 
tel MKL, respectively) on each platform. We can see that 
because of the very low preprocessing overhead, the CSR5 
can further outperform the previous methods when doing 
50 iterations and 500 iterations. Although two GPU meth¬ 
ods, the ACSR format and the CSR-Adaptive approach, in 
general have shorter preprocessing time, they suffer from 
lower SpMV performance and thus cannot obtain the best 
speedups. On all platforms, the CSR5 always achieves the 
highest overall speedups. Moreover, the CSR5 is the only 
format that obtains higher performance than the CSR for¬ 
mat when only 50 iterations are required. 

6. RELATED WORK 

A great deal of work has been published on accelerating 
the SpMV operation. The block-based sparse matrix 
construction has received most attention [3 n 0 na uni 
EH EU because of two main reasons: (1) sparse matrices 
generated by some real-world problems (e.g., finite element 
discretization) naturally have the block sub-structures, and 

(2) off-chip load operations may be decreased by using the 
block indices instead of the entry indices. However, for many 
matrices that do not exhibit a natural block structure, trying 
to extract the block information is time consuming and has 
limited effects. 

On the other hand, the hybrid formats I29|. such as 
HYB, have been designed for irregular matrices. However, 
higher kernel launch overhead and invalidated cache among 
kernel launches tend to decrease their overall performance. 
Moreover, it is hard to guarantee that every sub-matrix can 
saturate the whole device. In addition, some relatively sim¬ 
ple operations such as solving triangular systems become 
complex while the input matrix is stored in two or more 
separate parts. 

The recent row block methods showed good perfor¬ 
mance either for regular matrices m or for irregular matri¬ 
ces [1], but not for both. In contrast, the CSR5 can deliver 
higher throughput both for regular matrices and for irregu¬ 
lar matrices. 

The segmented sum methods have been used in two 
recently published papers |301134| for the SpMV on either 
GPUs or Xeon Phi. However, both of them need to store 
the matrix in GOO-like formats to utilize the segmented 
sum. In contrast, the GSR5 format saves useful row index 
information in a compact way, and thus can be more efficient 
both for the format conversion and for the SpMV operation. 

Sedaghati et al. EZ] constructed machine learning clas¬ 
sifiers for automatic selection of the best format for 
a given sparse matrix on a target GPU. The GSR5 format 
described in this work can further simplify such a selection 
process because it is insensitive to the sparsity structure of 
the input sparse matrix. 

Moreover, to the best of our knowledge, the GSR5 is the 
only format that supports high throughput cross-platform 
SpMV on GPUs, nVidia GPUs, AMD GPUs and Xeon Phi 
at the same time. This advantage may simplify the devel¬ 
opment of scientific software for processors with massive on- 
chip parallelism. 

7. CONCLUSIONS 

In this paper, we proposed the GSR5 format for efficient 
cross-platform SpMV on GPUs, GPUs and Xeon Phi. The 


format conversion from the GSR to the GSR5 was very fast 
because of the format’s insensitivity to sparsity structure of 
the input matrix. The GSR5-based SpMV was implemented 
by a redesigned segmented sum algorithm with higher SIMD 
utilization compared to the classic methods. The experimen¬ 
tal results showed that the GSR5 delivered high throughput 
both in the isolated SpMV tests and in the iteration-based 
scenarios. 
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